Developing a Dataset for Evaluating Approaches for Document Expansion with Images
نویسندگان
چکیده
Motivated by the adage that a “picture is worth a thousand words” it can be reasoned that automatically enriching the textual content of a document with relevant images can increase the readability of a document. Moreover, features extracted from the additional image data inserted into the textual content of a document may, in principle, be also be used by a retrieval engine to better match the topic of a document with that of a given query. In this paper, we describe our approach of building a ground truth dataset to enable further research into automatic addition of relevant images to text documents. The dataset is comprised of the official ImageCLEF 2010 collection (a collection of images with textual metadata) to serve as the images available for automatic enrichment of text, a set of 25 benchmark documents that are to be enriched, which in this case are children’s short stories, and a set of manually judged relevant images for each query story obtained by the standard procedure of depth pooling. We use this benchmark dataset to evaluate the effectiveness of standard information retrieval methods as simple baselines for this task. The results indicate that using the whole story as a weighted query, where the weight of each query term is its tf-idf value, achieves an precision of 0.1714 within the top 5 retrieved images on an average.
منابع مشابه
Learning Document Image Features With SqueezeNet Convolutional Neural Network
The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...
متن کاملDocument Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملA synthetic document image dataset for developing and evaluating historical document processing methods
Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and ...
متن کاملExtended ratio edge detector for despeckled SAR image evaluation
Synthetic aperture radar (SAR) images due to the usage of coherent imaging systems are affected by speckle. So lots of despeckling filters have been introduced up to now to suppress the speckle. Hence, objective and subjective evaluation of the denoised SAR images becomes a necessity. Thereby lots of objective evaluating estimators are introduced to evaluate the performance of despeckling filte...
متن کاملMapping of McGraw Cycle to RUP Methodology for Secure Software Developing
Designing a secure software is one of the major phases in developing a robust software. The McGraw life cycle, as one of the well-known software security development approaches, implements different touch points as a collection of software security practices. Each touch point includes explicit instructions for applying security in terms of design, coding, measurement, and maintenance of softwar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016